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We introduce a procedure to infer the interactions among a set of binary variables, based on 
their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables 
contributing most to the entropy of the inferred Ising model, and rejects the small contributions 
due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at 
criticality and in the low temperature phase, and is applied to neurobiological data. 



Understanding the correlated activity of complex, non- 
homogeneous multi-component systems is of fundamen- 
tal importance in physics, biology, sociology, finance, ... 
A natural issue is to separate direct correlations (due to 
direct interactions) from network-mediated correlations. 
The Ising model, of ubiquitous importance in statistical 
physics, provides a natural framework to extract inter- 
actions from correlations [H, and was recently used for 
the analysis of neurobiological data 0-131 • It is indeed 
the least constrained model capable of reproducing the 
individual and pairwise frequencies of a set of, say, N 
binary- valued variables, Ui = 0, 1. In practice, these fre- 
quencies. Pi and pij, are often estimated through empir- 
ical averages over a number of sampled configurations 
{ctJ, (Tj) ■ • ■ ) ^at}; b = I,. . . ,B. The task then consists in 
inferring the parameters (fields hi and interactions Jij) 
of the Ising model reproducing those data. From a math- 
ematical point of view, one has to solve the ^N{N + 1) 
implicit equations pi = (ai) and pij — {(JiCj) for the 
fields and interactions, where (•} denotes the Gibbs aver- 
age with Boltzmann factor exp ( J^i ^i'^i~^J2i<j Jij'^i^j)- 

Various approaches have been developed to solve the 
inverse Ising problem, called Boltzmann Machine (BM) 
in Machine Learning, includingBM learning [sj, mean 
field and message-passing 0^ methods, and pseudo- 
likelihood algorithms [9]. Despite their specificities, those 
methods have in common to be efficient when the corre- 
lations, Cij ~ Pij —piPj, are weak, and to perform badly 
when most pairs («, j) are strongly correlated, e.g. when 
the data are generated by a critical Ising model. Those 
examples seem to suggest that fast algorithms cannot in- 
fer BMs with long-range correlations fl^. 

However, the existence of a relationship between the 
presence of strong correlations in the 'direct' model and 
the intrinsic hardness of the inverse problem is ques- 
tionable jUl]. Let p = {pi,Pij}, J = {hi, Jij}, (cr) = 
{((Ti), (o-iO-j)} be the ^N{N + l)-dimensional vectors 
of, respectively, the measured frequencies, the interac- 
tion parameters and the Gibbs frequencies. We define 
the susceptibility and the inverse susceptibility matrices 
through, respectively. 



X is attached to the direct model, and quantifies how the 
frequencies respond to a small change in the interaction 
parameters. X~^i which gives the response of the BM 
interaction parameters to a small change in the frequen- 
cies, is a natural characterization for the inverse problem. 
An essential point, which has received little attention in 
the context of BM so far, is that is generally much 
sparser and shorter-range than x] evidence for this claim 
is reported below. Even if strong responses (and correla- 
tions) pervade the system, each BM interaction parame- 
ter may mostly depend on a small (compared to N) num- 
ber of frequencies p. Interestingly, the short-rangedness 
of x^^ makes the inference not only possible but also 
meaningful, as experiments generally probe limited parts 
of larger systems. 

In this letter, we present a method for inferring BM, 
exploiting this notion of limited dependence. The inter- 
action network is progressively unveiled, through a re- 
cursive processing of larger and larger subsets of vari- 
ables, which we call clusters. To each cluster F is asso- 
ciated an entropy AS'(F), which assesses how much the 
cluster is relevant to infer the BM. Clusters such that 
A5(F)| < Q, where 9 is a fixed threshold are discarded; 
the other clusters are kept and recursively used to gener- 
ate larger clusters. Threshold 6 must be large enough to 
avoid overfitting of the data corrupted by the sampling 
noise (finite B) and small enough in order not to miss 
important components of the interaction network. Con- 
trary to conventional cluster expansions 0, 01 , the num- 
ber, size, and composition of the clusters automatically 
adapt to the data, and, rather than the sole value of N, 
determine the running time of the algorithm. Pseudo- 
codes intended for the practical implementation of our 
algorithm are given in Supplemental Material jl2| . 

Our starting point is the Legendre transform of the 
partition function Z(J) (sum of the Boltzmann factors) 
of the Ising model. 
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where • denotes the dot product; it is the cross entropy 
between the sampled distribution and the best BM or, 
equivalently, the negative of the maximum log-likelihood 
of the parameters J given the data p [13|. Let us define 
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'S'o(p) = 5logdet(cij), where c^j = Cy /[pj(l - pi)pj{l 
Pj)]2 . We now formally write, for given p, 
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(3) 

where the sums run over every subset (cluster) of the N 
variables. The choice of expanding S — Sq rather than 
S will be explained later. According to ([3]) for = 1, 
AS{i) is the entropy of a single spin with average value 
Pi. Using dS]) again for N = 2, we find that AS{i,j) 
equals the loss in entropy when imposing the constraint 
{(JiCj) = pij to a system of 2 spins with fixed magne- 
tizations, {(Ji) = Pi, (cTj) = pj, minus the contribution 
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coming from 5*0. A recursive use of ^ 



and ([3]) for increasing N allows us to calculate AS'(r) for 
larger and larger clusters F = (ii, 12, . . . , ix)- The maxi- 
mal cluster size, say. A' = 20, is set by the computational 
hardness of obtaining S from ([2]). Note that A5'(r) is a 
function of the individual and pairwise frequencies of the 
spins in F only. 

To illustrate the properties of the cluster expansion ([3]) 
consider the 2D-Ising model on a Af x M grid (Fig.[T]), in 
the absence of sampling noise (B — 00). Enumerations 
of the 2*^ spin configurations allow us to calculate the 
frequencies p = (a) and the cluster-entropies AS'(F) ex- 
actly for small values of M (Fig. [T]A_). The entropy of the 
clusters F decreases exponentially with the length A(F) 
of the shortest closed interaction path joining the spins 
in F, e.g. A(l,2) = 2, i(l,3,6) = 6. The entropies 
of clusters sharing a common interaction path (and the 
same L) have alternating signs, depending on the parity 
of the cluster size (Fig. [T]4.); their sum is much smaller 
(in absolute value) than any cluster-entropy taken sep- 



arately 2l| . Figure [TJ3 shows the error eg on the en- 



tropy, when all cluster-entropies smaller than 9 are dis- 
carded. £5 exhibits lower and lower plateaus, separated 
by higher barriers as the threshold & is decreased. The 
first low plateau, eg ~ .002, takes place at Ql = .012, 
when all nearest-neighbor clusters (L = 2) are selected. 
The second and lower plateau, es — 5 10~^, is reached 
for 62 — 0.002, after all clusters with L — 4 are taken 
into account. Barriers in between plateaus correspond 
to values of Q, for which the truncation interrupts the 
summation (and partial cancellation) of all the clusters 
sharing an interaction path; the error on the entropy is 
then €s ~ O. 

Let us turn to the case of imperfect sampling (finite 
B). The measured correlations, Cij — pij ^ piPj, differ 
from the Gibbs correlations, [aiUj) — {<yi){<Jj), by random 
fiuctuations of amplitude v = 0{B^2y Those fluctua- 
tions do not affect much the largest correlations and the 
largest cluster-entropies. However, for the pairs i,j with 
weak Gibbs correlations {< v in absolute value), the mea- 
sured correlations are dominated by the noise. This fact 
has two consequences. First, the norm of the 2-point 
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FIG. 1: Exact cluster enumeration for a 3 x 3 grid, with cou- 
pling J = 1.778 (units of fc^T), corresponding to the critical 
value for an infinite grid [Ij. A. Cluster-entropies AS'(r) 
vs. length L(r) of the interaction path for perfect sampling; 
representative F are shown for L = 2, 4 (labels refer to the 
grid). B. error es vs. O for perfect sampling {B = 00, full 
curve), and two random samples with B — 10^ (dashed) and 
B = 4500 (dotted curve) configurations. The accuracy on es 
and each AS is ~ 10"^^. 
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FIG. 2: Histograms of AS'(ji, 12, is) for the ID-Ising (J — 
4, ft = —5, units of ksT) and Independent Spin (IS) models, 
for A = 50 and three values of B. The distributions collapse 
onto each other after rescaling by the standard deviation of 

the IS cluster-entropies, ASis{B) ~ ^£-^(|)f + O(^). 
Universality at small AS holds for larger cluster sizes {— 3 
here). Large- AS tails are not universal (not shown), and are 
specific to the interaction network, see Fig. [T}\. 
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susceptibility, \x2\ ^ — 



Nv is extensive: over- 



fitting makes the inferred Ising model look like critical. 
Secondly, the distribution of the cluster entropies is uni- 
versal for AS' — i> and N ^ 00: it coincides with the 
distribution for a system of Independent Spins, with the 
same piS as the original system, and the same number 
B of sampled configurations (Fig. [2]). The presence of 
this universal, noisy peak justifies the introduction of a 
threshold 8 and sets a lower bound to its value. Fig- 
ure [TJ3 shows that the error eg behaves as in the perfect 
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^5*0, are our approximations for 



sampling case for large G and saturates at low Q as ex- 
pected. Again, the entropy is accurately estimated by 
taking into account only the top cluster-entropies, asso- 
ciated to the dominant interaction paths on the lattice. 

Systematic enumeration of clusters is not possible for 
large systems. The example above suggests a fast, recur- 
sive procedure to build up clusters of increasing sizes, 
whose principle is based on the existence of paths of 
strong interactions connecting the spins. First we cal- 
culate the entropies associated to the N clusters with 
K = 1 spin. Then, two clusters Fi and F2 of size K can 
be merged to give birth to the cluster F = Fi U F2 of size 
K + 1 if Fi and F2 have exactly K ~1 common spins, and 
if |AS'(F)| > 9. The underlying principle is, again, that 
the building-up prescription should be compatible with 
the existence of a path of strong interactions connecting 
the spins, and that clusters with low entropies can be dis- 
carded. Each time a new cluster F is created and selected 
we store its contributions to the entropy, AS'(F) and to 
the interaction parameters, AJ(F) = — ^A5(F). The 
procedure naturally stops when no cluster of larger size 
can be built through the recursion. The sums of AS'(r) 
and AJ(F) over the selected clusters, added to, respec 
tively. So and Jo = 

the entropy and the interactions of the BM 

We now report the tests of the above inference al- 
gorithm on synthetic data generated from Ising models 
with known couplings. First we consider dilute ferro- 
magnets on 2D-grids of sizes AI x M; BM learning is 
hindered by the huge thermalization time at low temper- 
ature, mean-field and message-passing methods are not 
expected to be efficient on such loopy lattices and the 
Pseudo-Fikelihood (PF) algorithm of ^] fails outside the 
paramagnetic phase, even for M = 7 Our algo- 

rithm successfully retrieves the network of interactions 
at the critical point, in the low temperature phase, and 
for much larger sizes (Fig.[3K)- As e is lowered the error 
on Jij first decreases and then saturates to a value close 

to the Cramer-Rao bound, \J^X~^j [H (Fig. At 
the cross-over threshold the largest selected clusters have 
size 4, while ^ ^ M as the system is critical (Fig. [3j3). 
The running time of the algorithm (at the cross-over Q) 
is ^ 10 millisec on one core of an AMD Opteron dual- 
core processor at 3 Ghz. The inference algorithm is also 
applied to glassy frustrated Ising models [15'|, of various 
sizes N (Fig. [SjC). Performances do not seem to worsen 
as N increases. 

To better understand the saturation of the error and 
the quality of the inference we compare the difference 
i5p between the frequencies calculated from the inferred 
BM, (cr) [g^l, and the measures, p, to the fluctuations 
expected from the sampling of B configurations at equi- 
librium. The variance of these fluctuations are the di- 
agonal elements of %, divided by B. An estimate of 
the relative error for the one-site frequencies is thus 



1 

0.8 

■iO.6 
&-'o.4 
0.2 




□ 8=4500 
■ •B=45000 

ooPL, B=4500 




0.005 ^ 0.05 




0.02 ^ 0.2 



2 3 
J 







0.1 





FIG. 3: A. Fraction of non-zero interactions recognized by our 
procedure (squares) and by the PL algorithm (circles, from 
[101) vs. intensity J of couplings on a 7 x 7 grid with 30% 
dilution; similar performances are obtained for 20 x 20 grids. 
The critical coupling is Jc — 2.8 (units of fcsT) 14]. B. Error 
on the inferred couplings vs. Q for M x M grids at 'criticality' 
(Jc — 1.778, no dilution) and B = 4500; the dependence on 
A4 is reduced with periodic boundary conditions (not shown) . 
Inset; ec vs. O for M = 7. C. same as B for the Viana- 
Bray spin glass model [l^ (connectivity 5 and random Jij, 
uniform in [— j"; j"]; = 4 is larger than the spin glass 
critical coupling, J*' ~ 3.5 llSj). Inset: tc vs. Q ioi N = 50. 
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a similar expression can be writ- 
Values of e ^ 1 



ten for the error on the correlations, tc 
signal a poor inference, while overfitting corresponds to 
e ^ 1. This criterion is justified if the Gibbs fluctua- 
tions are comparable to the error bars that can be com- 
puted using statistical methods such as bootstrap. We 
find that ep and ec are close to 1 at the cross-over thresh- 
old for which the error on the couplings saturates (Insets 
of Fig. [3j3&C). Lowering Q further reduces ep,ec, but 
does not increase the accuracy on the interactions and is 
merely an overfitting of the data. 

The running time of our algorithm depends on the 
complexity of the underlying interaction network rather 
than on the system size. We analyze in Fig. 2] a 3180 
second-long recording of the retinal activity of a salaman- 
der, previously studied in Q using BM learning (TVi — 40 
cells). As is lowered, the number of selected clus- 
ters and their maximal size increase, and the entropy 
S reaches a plateau (Fig. H^&C). For 6* ~ 610"*^, the 
errors are e ~ 1, and the inferred Ising model reproduces 
the frequencies and correlations (Fig. |4jl)). We have also 
applied our algorithm to recordings of other neurobiolog- 
ical systems, including the cortical activity oi N2 — 37 
cells in a behaving rat [lH (not shown). While the am- 
plitudes of the interactions found with both data sets are 
similar, the maximal size of the selected clusters, which 
is a measure of the neighborhood of a cell on the interac- 
tion network, is much smaller for the cortical recording 
(= 3) than for the retinal activity {— 8). The lower com- 
plexity of the inferred network results in a lower CPU 
time {t2 — .1 sec vs. ii ~ 5 min on the computer above), 
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FIG. 4: Performance of the inference procedure as a function 
of the threshold Q. A. errors ep and tc\ B. number (bottom, 
left scale) and maximal size (top, right scale) of clusters; C. 
entropy 5(0) • D. reconstructed vs. experimental pi and Cij 
for 0* (error bars are calculated from x)- Spin values are 
o"i = 1 if cell i is active in a 20 msec-time bin, otherwise 
01 . Note that, while the correlations Cij are small, the ratios 
Cij / (piPj ) are of the order of unity, and so are the inferred 
couplings Jij. 



in spite of Ni ^ Ni. 

While expanding S alone in ^ would be possible, 
the cluster-entropies jA^rl produced by the expansion 
oi S — So are generally smaller [l7|. Therefore, less clus- 
ters are needed to achieve an accurate inference, and the 
fluctuations of es (Fig.[Tj3) and of Sp, Cc (Figs.[33&C and 
IDA.) are smaller, see discussion about barriers above. As 
Sn coincides with S for mean field models when TV — > oo 
y, it is a good starting point for the expansion even for 
systems with rather dense and weak interaction networks. 
In the case of severe undersampling, regularized versions 
of So including a penalty over the couplings based on the 
L2 (|I3) the Li [ig norm can be used. Note that 



(Jo), 



oc 



-(c is regular even at criticality, i.e. even 



if c has a diverging eigenvalue. 

Our work suggests that the BM problem can be solved 
efflciently even when data exhibit strong correlations. 
The contribution to due to a cluster F, 



c?pi9p 



is highly sparse since ASy depends on a few frequencies 
only. The success of our algorithm relies on the property 
that can be accurately approximated by such an ex- 
pansion (while X cannot). We now list four examples for 
which this property holds. In the ID-Ising model, is 
of finite-range when Jij couples nearest neighbours only, 
and decays exponentially with |« — j j in the presence of 
longer-range interactions [l9| . Next, consider the 0{ra) 
model, where the binary spins Ui are replaced with m- 
dimensional spins ai of fixed norms, with interactions 
Jij and zero fields. The model is exactly solvable in the 



(diagonal elements Ja enforce the constraints on |cri|). If 
J is sparse, so is X~^ i even if all correlations are strong. 
In liquid theory, the Ornstein-Zernike direct correlation 
function, a quantity closel y re lated to is widely be- 
lieved to be short-range [20|; this property is used in 
closure schemes, e.g. Percus-Yevick, to obtain the equa- 
tion of state. Even at the critical point of a ferromag- 
net ll| the response of the field hi to changes in the 
magnetizations mj of spins at distance larger than R, 
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quickly decays with R [2C 



Intuitively, the 0{N'^) correlations contain a highly re- 
dundant information about the 0{N) non-zero couplings 
which have generated them. This redundancy is at the 
origin of the 'locality' of and of the cancellation prop- 
erty of the cluster-entropies. 

We thank D. Chatenay, D. Huse, J. Lebowitz, S. Leibler, 
A. Montanari and V. Sessak for very useful discussions. 
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